Picture for Le Xue

Le Xue

Future Optical Flow Prediction Improves Robot Control & Video Generation

Add code
Jan 15, 2026
Viaarxiv icon

Enabling Ultra-Fast Cardiovascular Imaging Across Heterogeneous Clinical Environments with a Generalist Foundation Model and Multimodal Database

Add code
Dec 25, 2025
Viaarxiv icon

Robotic VLA Benefits from Joint Learning with Motion Image Diffusion

Add code
Dec 19, 2025
Viaarxiv icon

PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

Add code
Aug 06, 2025
Viaarxiv icon

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Add code
May 14, 2025
Viaarxiv icon

SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models

Add code
Feb 28, 2025
Figure 1 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Figure 2 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Figure 3 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Figure 4 for SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Viaarxiv icon

SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images

Add code
Feb 20, 2025
Figure 1 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Figure 2 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Figure 3 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Figure 4 for SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Viaarxiv icon

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Add code
Dec 09, 2024
Figure 1 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 2 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 3 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Figure 4 for ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Viaarxiv icon

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

Add code
Nov 12, 2024
Figure 1 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Figure 2 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Figure 3 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Figure 4 for BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Viaarxiv icon

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Add code
Oct 21, 2024
Figure 1 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 2 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 3 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 4 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Viaarxiv icon